Exercise Sheet 2 Solutions

Exercise Sheet 2 Solutions#

1.#

(a)#

Let

\[\begin{split} f(x, y) = \begin{cases} \dfrac{x \sin y}{x^2 + y^2} & \text{if } (x, y) \neq (0, 0) \\ 0 & \text{if } (x, y) = (0, 0) \end{cases} \end{split}\]

We are asked to examine the continuity of $ f $ in $ \mathbb{R}^2 $.

Remark We say that a single-variable function $ f : \$$bb{R} \rightarrow $$bb{R} $ is continuous at a point $ a \in \$$bb{R} $ if

\[ \lim_{x \to a} f(x) = f(a) \]

Extension to Two Variables Similarly, for a function of two variables $ f : \mathbb{R}^2 \rightarrow \mathbb{R} $, we say that $ f $ is continuous at the point $ (a, b) \in \mathbb{R}^2 $ if

\[ \lim_{(x, y) \to (a, b)} f(x, y) = f(a, b) \]

So, to study the continuity of $ f $, we need to check whether this limit exists and equals the value of the function at that point.

Strategy To determine the existence of

\[ \lim_{(x, y) \to (0, 0)} f(x, y), \]

we must examine whether the limit exists and is the same along all possible directions towards $ (0, 0) $.

Direction 1: Along the x-axis We approach $ (0, 0) $ along the x-axis, i.e., set $ y = 0 $.
Then:

\[ f(x, 0) = \frac{x \sin(0)}{x^2 + 0^2} = \frac{0}{x^2} = 0 \quad \text{for all } x \neq 0 \]

\[ \Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} f(x, 0) = 0 \]

Direction 2: Along the y-axis Let $ x = 0 $, then:

\[ f(0, y) = \frac{0 \cdot \sin(y)}{0 + y^2} = 0 \]

\[ \Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{y \to 0} f(0, y) = 0 \]

Direction 3: Along $ y = x $ Now we approach the origin along a different line, say $ y = x $:

\[ f(x, x) = \frac{x \sin x}{x^2 + x^2} = \frac{x \sin x}{2x^2} = \frac{\sin x}{2x} \]

\[ \Rightarrow \lim_{(x, y) \to (0, 0)} f(x, y) = \lim_{x \to 0} \frac{\sin x}{2x} = \frac{1}{2} \]

Since this limit $\frac{1}{2} \neq 0$, the two-dimensional limit

\[ \lim_{(x, y) \to (0, 0)} f(x, y) \]

does not exist, and hence $ f(x, y) $ is not continuous at the point $ (0, 0) $.

(b)#

Partial derivatives of $ f $ at point $ (0, 0) $ If we have a function of two variables

\[ f : \mathbb{R}^2 \rightarrow \mathbb{R}, \quad (x, y) \mapsto f(x, y) \]

then the partial derivative of $ f $ with respect to $ x $ at $ (a, b) $ is defined as:

\[ f_x(a, b) = \lim_{h \to 0} \frac{f(a + h, b) - f(a, b)}{h} \]

and similarly, the partial derivative of $ f $ with respect to $ y $ at $ (a, b) $ is:

\[ f_y(a, b) = \lim_{h \to 0} \frac{f(a, b + h) - f(a, b)}{h} \]

Compute partial derivatives at $ (0, 0) $ -Partial derivative with respect to $ x $:

\[ f_x(0, 0) = \lim_{h \to 0} \frac{f(h, 0) - f(0, 0)}{h} = \lim_{h \to 0} \frac{0 - 0}{h} = 0 \]

-Partial derivative with respect to $ y $:

\[ f_y(0, 0) = \lim_{h \to 0} \frac{f(0, h) - f(0, 0)}{h} = \lim_{h \to 0} \frac{0 - 0}{h} = 0 \]

(c)#

At which points is $ f $ differentiable? To determine where the function $ f $ is differentiable, we use the following theorem:

Remark (Theorem) If $ f $ is a continuous function in an open set $ U $,
and has continuous partial derivatives at $ U $,
then $ f $ is continuously differentiable at all points in $ U $.

Let $ U = \mathbb{R}^2 \setminus \{(0, 0)\} $.
The function $ f(x, y) = \dfrac{x \sin y}{x^2 + y^2} $ is continuous at all points in $ U $.

Now we examine the partial derivatives of $ f $:

Compute $ \dfrac{\partial f}{\partial x} $ and $ \dfrac{\partial f}{\partial y} $

\[ \frac{\partial}{\partial x} \left( \frac{x \sin y}{x^2 + y^2} \right) = \frac{(x^2 + y^2)\sin y - 2x^2 \sin y}{(x^2 + y^2)^2} = \frac{(y^2 - x^2)\sin y}{(x^2 + y^2)^2} \]

\[ \frac{\partial}{\partial y} \left( \frac{x \sin y}{x^2 + y^2} \right) = \frac{x \cos y (x^2 + y^2) - 2x y \sin y}{(x^2 + y^2)^2} \]

These are rational functions where the numerator and denominator are composed of continuous functions, and the denominator only vanishes at the origin $ (0, 0) $.
Thus, the partial derivatives are continuous everywhere in $ U $.

Conclusion So, based on the theorem, function $ f $ is differentiable at all points except the origin, that is, point $ (0, 0) $.

2.#

(a)#

Let the function $ f(z) = \exp\left(-\dfrac{1}{2} z\right) $,
where $ z = g(y) = y^\top S^{-1} y $,
and $ y = h(x) = x - u $,
with:

$ x, u \in \mathbb{R}^D $
$ S \in \mathbb{R}^{D \times D} $

Chain Rule Based on the chain rule, we have:

\[ \frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dy} \cdot \frac{dy}{dx} \]

Step 1: Note the functions and their domains

$ y = h(x) = x - u $ → maps $ \mathbb{R}^D \to \mathbb{R}^D $
$ z = g(y) = y^\top S^{-1} y $ → maps $ \mathbb{R}^D \to \mathbb{R} $
$ f(z) = e^{- \frac{1}{2} z} $ → maps $ \mathbb{R} \to \mathbb{R} $

So the full composition is:

\[ x \mapsto y = x - u \mapsto z = y^\top S^{-1} y \mapsto f(z) = e^{- \frac{1}{2} z} \]

Step 2: Compute $ \dfrac{dy}{dx} $ Since $ y = x - u $, the Jacobian $ \dfrac{dy}{dx} $ is:

\[ \frac{dy}{dx} = I_{D \times D} \quad \text{(identity matrix)} \]

Step 3: Compute $ \dfrac{dz}{dy} $ We have $ z = y^\top S^{-1} y $.
Using gradient rules for quadratic forms:

\[ \frac{d}{dy} (y^\top A y) = y^\top (A + A^\top) \]

Apply this:

\[ \frac{dz}{dy} = y^\top (S^{-1} + (S^{-1})^\top) \quad \in \mathbb{R}^{1 \times D} \]

Step 4: Compute $ \dfrac{df}{dz} $

\[ f(z) = e^{- \frac{1}{2} z} \quad \Rightarrow \quad \frac{df}{dz} = -\frac{1}{2} e^{- \frac{1}{2} z} \quad \in \mathbb{R} \]

Final Result

\[ \frac{df}{dx} = -\frac{1}{2} e^{- \frac{1}{2} z} \cdot y^\top (S^{-1} + (S^{-1})^\top) \quad \in \mathbb{R}^{1 \times D} \]

(b)#

Let

\[ f(z) = \tanh(z), \quad z = Ax + b \]

where:

$ x \in \mathbb{R}^N $
$ A \in \mathbb{R}^{M \times N} $
$ b \in \mathbb{R}^M $

Apply Chain Rule

\[ \frac{df}{dx} = \frac{df}{dz} \cdot \frac{dz}{dx} \]

Step 1: Understand $ z = Ax + b $ We note:

\[ z = Ax + b \in \mathbb{R}^M \Rightarrow \frac{dz}{dx} = A \in \mathbb{R}^{M \times N} \]

Step 2: Compute $ \dfrac{df}{dz} $ We have:

\[\begin{split} f(z) = \begin{bmatrix} \tanh(z_1) \\ \tanh(z_2) \\ \vdots \\ \tanh(z_M) \end{bmatrix} \end{split}\]

So the Jacobian of $ f $ is diagonal:

\[\begin{split} \frac{df}{dz} = \begin{bmatrix} \text{sech}^2(z_1) & & \\ & \ddots & \\ & & \text{sech}^2(z_M) \end{bmatrix} \in \mathbb{R}^{M \times M} \end{split}\]

Final Result

\[ \frac{df}{dx} = \operatorname{diag}\left( \text{sech}^2(z_1),\ \text{sech}^2(z_2),\ \dots,\ \text{sech}^2(z_M) \right) \cdot A \quad \in \mathbb{R}^{M \times N} \]

3.#

(a)#

Let

\[\begin{split} x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \eta = 1 \end{split}\]

and perform two steps of gradient descent.

The update rule for gradient descent is:

\[ x^{(i+1)} = x^{(i)} - \eta \nabla f(x^{(i)}) \]

So two steps of the gradient descent algorithm are:

\[ \text{Step 1:} \quad x^{(1)} = x^{(0)} - \eta \nabla f(x^{(0)}) \]

\[ \text{Step 2:} \quad x^{(2)} = x^{(1)} - \eta \nabla f(x^{(1)}) \]

Given the gradient:

\[\begin{split} \nabla f = \begin{bmatrix} x_1 + 2 \\ 2x_2 + 1 \end{bmatrix} \end{split}\]

We compute:

\[\begin{split} x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \quad \Rightarrow \quad \nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} \end{split}\]

Step 1:

\[\begin{split} x^{(1)} = x^{(0)} - 1 \cdot \nabla f(x^{(0)}) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} - \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} -2 \\ -1 \end{bmatrix} \end{split}\]

\[\begin{split} \nabla f(x^{(1)}) = \begin{bmatrix} -2 + 2 \\ -2 + 1 \end{bmatrix} = \begin{bmatrix} 0 \\ -1 \end{bmatrix} \end{split}\]

Step 2:

\[\begin{split} x^{(2)} = x^{(1)} - 1 \cdot \nabla f(x^{(1)}) = \begin{bmatrix} -2 \\ -1 \end{bmatrix} - \begin{bmatrix} 0 \\ -1 \end{bmatrix} = \begin{bmatrix} -2 \\ 0 \end{bmatrix} \end{split}\]

(b)#

Will the gradient descent procedure from part (b) converge to the minimizer $ x^* $? Why or why not? How can we fix it?

Let’s look at the values over iterations:

\[\begin{split} x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad x^{(1)} = \begin{bmatrix} -2 \\ -1 \end{bmatrix}, \quad x^{(2)} = \begin{bmatrix} -2 \\ 0 \end{bmatrix}, \quad x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} \end{split}\]

And:

\[\begin{split} \nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad \nabla f(x^{(1)}) = \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \quad \nabla f(x^{(2)}) = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad \nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \end{split}\]

We observe that gradient descent does not converge to $ x^* $. Why?

Because the gradients do not decrease constantly.
Let’s examine the partial derivatives:

\[ \frac{\partial f}{\partial x_1} \big|_{x^{(0)}} = 2, \quad \frac{\partial f}{\partial x_1} \big|_{x^{(1)}} = 0, \quad \frac{\partial f}{\partial x_1} \big|_{x^{(2)}} = 0 \]

\[ \frac{\partial f}{\partial x_2} \big|_{x^{(0)}} = 1, \quad \frac{\partial f}{\partial x_2} \big|_{x^{(1)}} = -1, \quad \frac{\partial f}{\partial x_2} \big|_{x^{(2)}} = 1 \]

Since $ x^* $ is a minimum and $ \nabla f(x^*) = 0 $, we expect the GD algorithm to converge to $ x^* $ if the partial derivatives reduce toward zero.

But here, GD jumps over the minimum due to a too high learning rate $ \eta = 1 $. If we decrease the learning rate, convergence improves.

Trying smaller learning rates: Let’s try $ \eta = 0.5 $:

\[\begin{split} x^{(0)} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix} \end{split}\]

Step 1:

\[\begin{split} x^{(1)} = x^{(0)} - 0.5 \cdot \nabla f(x^{(0)}) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} - 0.5 \cdot \begin{bmatrix} 2 \\ 1 \end{bmatrix} = \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} \end{split}\]

\[\begin{split} \nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \end{split}\]

Step 2:

\[\begin{split} x^{(2)} = x^{(1)} - 0.5 \cdot \nabla f(x^{(1)}) = \begin{bmatrix} -1 \\ -0.5 \end{bmatrix} - 0.5 \cdot \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} -1.5 \\ -0.5 \end{bmatrix} \end{split}\]

Now we see that the GD algorithm converges towards:

\[\begin{split} x^* = \begin{bmatrix} -2 \\ -0.5 \end{bmatrix} \end{split}\]

with gradients:

\[\begin{split} \nabla f(x^{(0)}) = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \quad \nabla f(x^{(1)}) = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \nabla f(x^{(2)}) = \begin{bmatrix} 0.5 \\ 0 \end{bmatrix}, \quad \nabla f(x^*) = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \end{split}\]

✔️ So a smaller $ \eta $ leads to proper convergence!

Exercise Sheet 2 Solutions

Contents

Exercise Sheet 2 Solutions#

1.#

(a)#

(b)#

(c)#

2.#

(a)#

(b)#

3.#

(a)#

(b)#